| Package | Version | Purpose |
|---|---|---|
| ggplot2 | >=3.4.0 | Data visualization |
| dplyr | >=1.1.0 | Data manipulation |
| tidyr | >=1.3.0 | Data tidying |
| readr | >=2.1.0 | File I/O |
| purrr | >=1.0.0 | Functional programming |
| stringr | >=1.5.0 | String manipulation |
| tibble | >=3.2.0 | Modern data frames |
| patchwork | >=1.1.0 | Plot composition |
| here | >=1.0.0 | Project-relative paths |
| scales | >=1.2.0 | Axis formatting |
| randomForest | >=4.7 | Classification model |
| knitr | >=1.40 | Report generation |
| kableExtra | >=1.3.0 | Table formatting |
Single-Sample cfDNA WGBS Analysis Report
Comprehensive cfDNA Fragmentomics and Methylation Analysis for ALS/Control Classification
This report presents a comprehensive analysis of a single cell-free DNA (cfDNA) sample from whole-genome bisulfite sequencing (WGBS) data. The analysis follows the same algorithms and methods used in the ALS vs Control cohort study to ensure reproducibility and comparability. Outputs include quality control metrics, insert size distributions, 5’ end motif profiling, and classification prediction using a pre-trained Random Forest model.
cfDNA, WGBS, ALS, methylation, fragmentomics, classification, MethTile, Random Forest
Introduction
This report presents an representative analysis of a single cell-free DNA (cfDNA) sample from whole-genome bisulfite sequencing (WGBS) data. The analysis follows the same algorithms and methods used in the ALS vs Control study to ensure reproducibility and comparability.
Analysis Overview
The pipeline generates the following outputs:
- Quality Control Metrics: Mapping statistics, bisulfite conversion efficiency, and coverage metrics
- Fragment Size Analysis: Insert size distributions showing the characteristic cfDNA nucleosome pattern
- Fragmentation Ratio Track: Genome-wide short/long fragment ratio visualization
- Genomic Distribution: Distribution of fragment 5’ start sites across genomic features
- TSS Enrichment: Nucleosome positioning signal around transcription start sites
- End Motif Analysis: 5’ end 4-mer motif frequencies reflecting nuclease preferences
- Classification Prediction: Predicted group (ALS vs Ctrl) using pre-trained Random Forest model
Input Files
This notebook requires two input files:
- Bismark-aligned BAM file: Deduplicated BAM with XM methylation tags
- Bismark coverage file: Output from
bismark_methylation_extractor(.bismark.cov.gz)
If you only have a Bismark-aligned BAM file, run the following command to extract methylation calls:
For detailed documentation, see: Bismark Methylation Extraction
Dependencies
This section lists all software dependencies required to run this analysis.
R Packages
CRAN Packages
Bioconductor Packages
| Package | Version | Purpose |
|---|---|---|
| Rsamtools | >=2.14.0 | BAM file handling |
| cigarillo | >=1.0.0 | CIGAR string parsing |
| GenomicRanges | >=1.50.0 | Genomic intervals |
| IRanges | >=2.32.0 | Integer ranges |
| S4Vectors | >=0.36.0 | S4 vector infrastructure |
| Biostrings | >=2.66.0 | DNA sequence handling |
| BSgenome.Hsapiens.UCSC.hg38 | >=1.4.0 | Human reference genome (hg38) |
| GenomeInfoDb | >=1.34.0 | Genome metadata |
| TxDb.Hsapiens.UCSC.hg38.knownGene | >=3.16.0 | Gene annotations (hg38) |
| GenomicFeatures | >=1.50.0 | Genomic features |
| ChIPseeker | >=1.34.0 | Peak annotation |
| org.Hs.eg.db | >=3.16.0 | Human gene IDs |
| AnnotationDbi | >=1.60.0 | Annotation infrastructure |
| bsseq | >=1.34.0 | Bisulfite-seq analysis |
| Gviz | >=1.42.0 | Genomic track visualization |
External Tools
| Tool | Version | Purpose | Link |
|---|---|---|---|
| Bismark | >=0.24.0 | Bisulfite read alignment & methylation extraction | [GitHub](ht |
| samtools | >=1.17 | BAM file manipulation (indexing) | [GitHub](ht |
| Quarto | >=1.3.0 | Document rendering | [quarto.org |
# CRAN packages
install.packages(c("ggplot2", "dplyr", "tidyr", "readr", "purrr",
"stringr", "tibble", "patchwork", "here", "scales",
"randomForest", "knitr", "kableExtra"))
# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"Rsamtools", "cigarillo", "GenomicRanges", "IRanges", "S4Vectors",
"Biostrings", "BSgenome.Hsapiens.UCSC.hg38", "GenomeInfoDb",
"TxDb.Hsapiens.UCSC.hg38.knownGene", "GenomicFeatures",
"ChIPseeker", "org.Hs.eg.db", "AnnotationDbi",
"bsseq", "Gviz"
))Setup and Input Validation
Code
# Load required packages
suppressPackageStartupMessages({
# CRAN packages
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(stringr)
library(tibble)
library(patchwork)
library(here)
library(knitr)
library(kableExtra)
library(scales)
library(randomForest)
# Bioconductor packages
library(Rsamtools)
library(cigarillo)
library(GenomicRanges)
library(IRanges)
library(S4Vectors)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg38)
library(GenomeInfoDb)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicFeatures)
library(ChIPseeker)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(bsseq)
library(Gviz)
})
# Load analysis functions (from same directory as this notebook)
# The functions file should be in the same directory as this .qmd file
source("cfDNA_analysis_functions.R")
# Set seed for reproducibility
set.seed(389)Code
# Parameters from YAML
BAM_FILE <- params$bam_file
BISMARK_COV_FILE <- params$bismark_cov_file
SAMPLE_ID <- params$sample_id
OUTPUT_DIR <- params$output_dir
PROJECT_DIR <- params$project_dir
# Create output directory
if (!dir.exists(OUTPUT_DIR)) {
dir.create(OUTPUT_DIR, recursive = TRUE, showWarnings = FALSE)
}
# Temporary directory for intermediate files
TEMP_DIR <- file.path(OUTPUT_DIR, "temp")
if (!dir.exists(TEMP_DIR)) {
dir.create(TEMP_DIR, recursive = TRUE, showWarnings = FALSE)
}
# Load reference data
genome <- BSgenome.Hsapiens.UCSC.hg38
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# Path to pre-trained model and data files (local to this notebook)
# These files are copied from the main analysis to ensure standalone operation
DATA_RDS_DIR <- "data/rds"
MODEL_PATH <- file.path(DATA_RDS_DIR, "nested_cv_results.rds")
TILE_SCALED_PATH <- file.path(DATA_RDS_DIR, "tile_scaled_matrix.rds")
SCALING_PARAMS_PATH <- file.path(DATA_RDS_DIR, "meththile_scaling_params.rds")Code
# Validate input files
validation_results <- tibble::tibble(
File = c("BAM file", "Bismark coverage file", "Pre-trained model",
"Tile scaled matrix", "Scaling parameters"),
Path = c(BAM_FILE, BISMARK_COV_FILE, MODEL_PATH, TILE_SCALED_PATH, SCALING_PARAMS_PATH),
Exists = c(file.exists(BAM_FILE), file.exists(BISMARK_COV_FILE), file.exists(MODEL_PATH),
file.exists(TILE_SCALED_PATH), file.exists(SCALING_PARAMS_PATH))
)
knitr::kable(validation_results, caption = "Input file validation") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::row_spec(which(!validation_results$Exists), bold = TRUE, color = "white", background = "#E64B35")| File | Path | Exists |
|---|---|---|
| BAM file | ../data/processed/bam_name_sorted/SRR13404379.deduplicated.sorted_ds10mill_chr21.name_sorted.bam | TRUE |
| Bismark coverage file | ../data/processed/methylation_extractor/SRR13404379.deduplicated.sorted_ds10mill_chr21.name_sorted/SRR13404379.deduplicated.sorted_ds10mill_chr21.name_sorted.bismark.cov.gz | TRUE |
| Pre-trained model | data/rds/nested_cv_results.rds | TRUE |
| Tile scaled matrix | data/rds/tile_scaled_matrix.rds | TRUE |
| Scaling parameters | data/rds/meththile_scaling_params.rds | TRUE |
QC Metrics
This section extracts quality control metrics from the BAM file, including mapping statistics, fragment size distribution, and methylation metrics.
Code
# Extract BAM metrics (this generates the fragment BED file as well)
message("Extracting BAM metrics...")
metrics <- extract_bam_metrics(
bam_path = BAM_FILE,
sample_id = SAMPLE_ID,
genome = genome,
chunk_size = 1e6,
frag_dir = TEMP_DIR
)
# Calculate summary statistics
summary_stats <- calculate_summary_stats(metrics)Code
# Format key metrics for display
qc_display <- tibble::tibble(
Metric = c(
"Total Reads",
"Mapped Reads",
"Mapping Rate",
"Properly Paired",
"Proper Pair Rate",
"Duplicates",
"Duplicate Rate",
"Filtered Reads (MAPQ≥30)",
"Filter Pass Rate",
"Mean MAPQ",
"Number of Fragments",
"Mean Fragment Length",
"Median Fragment Length",
"Mode Fragment Length",
"GC Content",
"CpG Methylation",
"Bisulfite Conversion"
),
Value = c(
format(summary_stats$total_reads, big.mark = ","),
format(summary_stats$mapped_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$mapping_rate * 100),
format(summary_stats$properly_paired, big.mark = ","),
sprintf("%.2f%%", summary_stats$proper_pair_rate * 100),
format(summary_stats$duplicates, big.mark = ","),
sprintf("%.2f%%", summary_stats$duplicate_rate * 100),
format(summary_stats$filtered_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$filter_pass_rate * 100),
sprintf("%.1f", summary_stats$mean_mapq),
format(summary_stats$n_fragments, big.mark = ","),
sprintf("%.1f bp", summary_stats$mean_frag_length),
sprintf("%.0f bp", summary_stats$median_frag_length),
sprintf("%.0f bp", summary_stats$frag_mode),
sprintf("%.2f%%", summary_stats$gc_content * 100),
sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100)
)
)
knitr::kable(qc_display, caption = paste("QC Summary for", SAMPLE_ID)) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::pack_rows("Alignment Statistics", 1, 9) %>%
kableExtra::pack_rows("Fragment Statistics", 10, 14) %>%
kableExtra::pack_rows("Methylation Statistics", 15, 17)| Metric | Value |
|---|---|
| Alignment Statistics | |
| Total Reads | 125,496 |
| Mapped Reads | 125,496 |
| Mapping Rate | 100.00% |
| Properly Paired | 125,496 |
| Proper Pair Rate | 100.00% |
| Duplicates | 0 |
| Duplicate Rate | 0.00% |
| Filtered Reads (MAPQ≥30) | 108,714 |
| Filter Pass Rate | 86.63% |
| Fragment Statistics | |
| Mean MAPQ | 38.2 |
| Number of Fragments | 54,357 |
| Mean Fragment Length | 161.4 bp |
| Median Fragment Length | 156 bp |
| Mode Fragment Length | 156 bp |
| Methylation Statistics | |
| GC Content | 38.38% |
| CpG Methylation | 81.27% |
| Bisulfite Conversion | 99.74% |
QC Metrics Visualization
Code
# Prepare data for bar chart
qc_plot_data <- tibble::tibble(
metric = c("Total Reads (M)", "Filtered Reads (M)", "Mean MAPQ",
"Median Fragment (bp)", "GC Content (%)",
"CpG Methylation (%)", "Bisulfite Conv. (%)"),
value = c(
summary_stats$total_reads / 1e6,
summary_stats$filtered_reads / 1e6,
summary_stats$mean_mapq,
summary_stats$median_frag_length,
summary_stats$gc_content * 100,
summary_stats$cpg_methylation * 100,
summary_stats$bisulfite_conversion * 100
),
category = c("Reads", "Reads", "Quality", "Fragment", "Fragment", "Methylation", "Methylation")
) %>%
dplyr::mutate(
metric = factor(metric, levels = rev(metric)),
label = dplyr::case_when(
grepl("Reads", metric) ~ sprintf("%.1fM", value),
grepl("MAPQ", metric) ~ sprintf("%.1f", value),
grepl("Fragment", metric) & !grepl("%", metric) ~ sprintf("%.0f bp", value),
TRUE ~ sprintf("%.1f%%", value)
)
)
# Create horizontal bar chart
p_qc <- ggplot(qc_plot_data, aes(x = value, y = metric, fill = category)) +
geom_col(width = 0.7, alpha = 0.9) +
geom_text(aes(label = label), hjust = -0.1, size = 3.5, fontface = "bold") +
scale_fill_brewer(palette = "Set2") +
scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
labs(
title = paste("QC Metrics:", SAMPLE_ID),
subtitle = "Key quality indicators from BAM file analysis",
x = "Value",
y = NULL,
fill = "Category"
) +
theme_pub(base_size = 12) +
theme(
legend.position = "top",
panel.grid.major.y = element_blank()
)
p_qcFragmentation Analysis
Cell-free DNA exhibits characteristic insert size patterns reflecting nucleosome protection. This section analyzes the insert size distribution and genome-wide short/long fragment ratio.
Insert Size Distribution
Code
# Prepare fragment length data
frag_df <- tibble::tibble(
fragment_length = metrics$fragment_lengths
) %>%
dplyr::filter(fragment_length >= 80, fragment_length <= 450)
# Count fragments by length
counts_saw <- frag_df %>%
dplyr::count(fragment_length, name = "n")
# Calculate mode for annotation
calc_mode <- function(x) {
x <- x[is.finite(x)]
if (length(x) == 0) return(NA_integer_)
x <- as.integer(x)
tabulate(x, nbins = max(x)) |> which.max()
}
mode_bp <- calc_mode(metrics$fragment_lengths)
median_bp <- median(metrics$fragment_lengths, na.rm = TRUE)
short_n <- sum(metrics$fragment_lengths >= 100 & metrics$fragment_lengths <= 150)
long_n <- sum(metrics$fragment_lengths >= 151 & metrics$fragment_lengths <= 220)
short_long_ratio <- if (long_n > 0) short_n / long_n else NA
# Create sawtooth plot
p_sawtooth <- ggplot(counts_saw, aes(x = fragment_length, y = n)) +
geom_line(linewidth = 0.5, color = "gray20") +
geom_vline(xintercept = 167, linetype = "dashed", color = "#E64B35", linewidth = 0.6) +
geom_vline(xintercept = 334, linetype = "dashed", color = "#4DBBD5", linewidth = 0.6) +
annotate("text", x = 175, y = max(counts_saw$n) * 0.95,
label = "Mono-nucleosome\n(~167 bp)", hjust = 0, size = 3, color = "#E64B35") +
annotate("text", x = 342, y = max(counts_saw$n) * 0.5,
label = "Di-nucleosome\n(~334 bp)", hjust = 0, size = 3, color = "#4DBBD5") +
annotate("label", x = 380, y = max(counts_saw$n) * 0.85,
label = sprintf("Median: %d bp\nMode: %d bp\nS/L ratio: %.3f",
as.integer(median_bp), mode_bp, short_long_ratio),
hjust = 0, size = 3, fill = "white", label.size = 0.3) +
labs(
title = "cfDNA Fragment Size Distribution",
subtitle = paste("Sample:", SAMPLE_ID),
x = "Fragment Length (bp)",
y = "Count"
) +
theme_pub(base_size = 12)
p_sawtoothFragmentation Ratio Track
The short/long fragment ratio varies across the genome and can reveal nucleosome positioning differences.
Code
# Set up genomic bins (chr21 only for efficiency)
bin_size <- 2e6
chroms <- "chr21"
seqlens <- GenomeInfoDb::seqlengths(genome)[chroms]
seqlens <- seqlens[!is.na(seqlens)]
bins_gr <- GenomicRanges::tileGenome(
seqlengths = seqlens,
tilewidth = bin_size,
cut.last.tile.in.chrom = TRUE
)
bins_df <- tibble::tibble(
bin_id = seq_along(bins_gr),
chrom = as.character(GenomeInfoDb::seqnames(bins_gr)),
start = start(bins_gr),
end = end(bins_gr),
mid = as.integer(floor((start + end) / 2))
)
# GC content per bin
bin_seqs <- Biostrings::getSeq(genome, bins_gr)
freq <- Biostrings::letterFrequency(bin_seqs, letters = c("A", "C", "G", "T", "N"), as.prob = FALSE)
acgt <- rowSums(freq[, c("A", "C", "G", "T"), drop = FALSE])
gc <- (freq[, "G"] + freq[, "C"]) / pmax(acgt, 1)
n_frac <- freq[, "N"] / Biostrings::width(bin_seqs)
bins_df <- bins_df %>%
dplyr::mutate(gc = as.numeric(gc), n_frac = as.numeric(n_frac))
# GC filtering thresholds
gc_min <- 0.10
gc_max <- 0.85
n_frac_max <- 0.45
bins_keep_for_gc <- is.finite(bins_df$gc) &
bins_df$gc >= gc_min & bins_df$gc <= gc_max &
is.finite(bins_df$n_frac) & bins_df$n_frac <= n_frac_max
# Compute bin counts
fragment_bed_path <- file.path(TEMP_DIR, paste0(SAMPLE_ID, ".fragments.bed.gz"))
bin_tracks <- bin_counts_one_sample(
path = fragment_bed_path,
bins_gr = bins_gr,
bins_df = bins_df,
bins_keep_for_gc = bins_keep_for_gc
)
bin_tracks <- bin_tracks %>%
dplyr::left_join(bins_df, by = "bin_id")Code
# Plot fragmentation ratio track
p_frag_track <- ggplot(bin_tracks, aes(x = mid / 1e6, y = short_long_ratio)) +
geom_hline(yintercept = 1, color = "gray70", linewidth = 0.35) +
geom_line(linewidth = 0.8, color = "#2E4A62") +
geom_point(size = 1.5, color = "#2E4A62", alpha = 0.7) +
scale_x_continuous(breaks = seq(0, 50, by = 10)) +
labs(
title = paste("Fragmentation Ratio Track:", SAMPLE_ID),
subtitle = sprintf("GC-corrected short(90-150bp)/long(151-220bp) ratio in %d Mb bins on chr21",
bin_size / 1e6),
x = "Genomic Position (Mb)",
y = "Short/Long Ratio"
) +
theme_pub(base_size = 12)
p_frag_trackCode
# Gviz visualization
if (requireNamespace("Gviz", quietly = TRUE)) {
genome_build <- "hg38"
chrom <- "chr21"
from_bp <- min(bins_df$start, na.rm = TRUE)
to_bp <- max(bins_df$end, na.rm = TRUE)
gr_bins <- GenomicRanges::GRanges(
seqnames = bins_df$chrom,
ranges = IRanges::IRanges(start = bins_df$start, end = bins_df$end)
)
y <- bin_tracks$short_long_ratio
y_lim <- range(y[is.finite(y)], na.rm = TRUE)
y_pad <- 0.15 * diff(y_lim)
y_lim <- c(y_lim[1] - y_pad, y_lim[2] + y_pad)
axis_track <- Gviz::GenomeAxisTrack(genome = genome_build, chromosome = chrom, name = "")
ratio_track <- Gviz::DataTrack(
range = gr_bins,
data = y,
genome = genome_build,
chromosome = chrom,
name = "S/L ratio",
type = "l",
col = "#2E4A62",
lwd = 2,
ylim = y_lim,
baseline = 1,
col.baseline = "gray45",
lty.baseline = 2,
lwd.baseline = 1,
cex.axis = 0.9,
cex.title = 0.9
)
ideo_track <- tryCatch(
Gviz::IdeogramTrack(genome = genome_build, chromosome = chrom),
error = function(e) NULL
)
tracks <- list(axis_track, ratio_track)
if (!is.null(ideo_track)) tracks <- c(list(ideo_track), tracks)
Gviz::plotTracks(
tracks,
from = from_bp,
to = to_bp,
background.title = "white",
col.title = "black",
cex.title = 0.95,
background.panel = "white",
col.axis = "black",
col.frame = "gray85"
)
}Fragment Start Site Analysis
This section analyzes the genomic distribution of fragment 5’ start sites and their enrichment around transcription start sites.
Genomic Distribution
Code
>> preparing features information... 2026-01-30 17:34:25
>> Using Genome: hg38 ...
>> identifying nearest features... 2026-01-30 17:34:26
>> calculating distance from peak to TSS... 2026-01-30 17:34:27
>> assigning genomic annotation... 2026-01-30 17:34:27
>> Using Genome: hg38 ...
>> Using Genome: hg38 ...
>> adding flank feature information from peaks... 2026-01-30 17:35:08
>> assigning chromosome lengths 2026-01-30 17:35:12
>> done... 2026-01-30 17:35:12
Code
# Prepare data for donut chart
annot_data <- start_annot %>%
tidyr::pivot_longer(
cols = c(Promoter, Exon, Intron, Three_UTR, Five_UTR, Distal_Intergenic, Downstream),
names_to = "annotation",
values_to = "n"
) %>%
dplyr::mutate(
pct = 100 * n / sum(n),
annotation = dplyr::case_when(
annotation == "Three_UTR" ~ "3' UTR",
annotation == "Five_UTR" ~ "5' UTR",
annotation == "Distal_Intergenic" ~ "Distal Intergenic",
TRUE ~ annotation
),
annotation = factor(annotation, levels = c("Promoter", "5' UTR", "Exon", "Intron",
"3' UTR", "Downstream", "Distal Intergenic"))
) %>%
dplyr::arrange(annotation) %>%
dplyr::mutate(
ymax = cumsum(pct),
ymin = dplyr::lag(ymax, default = 0),
label_pos = (ymin + ymax) / 2,
label = sprintf("%s\n%.1f%%", annotation, pct)
)
# Create donut chart
p_donut <- ggplot(annot_data, aes(ymax = ymax, ymin = ymin, xmax = 4, xmin = 2.5, fill = annotation)) +
geom_rect(color = "white", linewidth = 0.5) +
geom_text(aes(x = 3.25, y = label_pos, label = sprintf("%.1f%%", pct)),
size = 3, color = "white", fontface = "bold") +
coord_polar(theta = "y") +
xlim(c(1, 4)) +
scale_fill_brewer(palette = "Set2") +
annotate("text", x = 1, y = 0, label = paste("Total\n", format(start_annot$total_starts, big.mark = ",")),
size = 4, fontface = "bold") +
labs(
title = "Genomic Distribution of Fragment 5' Starts",
subtitle = paste("Sample:", SAMPLE_ID),
fill = "Annotation"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "right"
)
p_donutTSS Enrichment
Code
# Get muscle-related genes on chr21
muscle_symbols <- c("COL6A1", "COL6A2")
muscle_map <- AnnotationDbi::select(
org.Hs.eg.db::org.Hs.eg.db,
keys = muscle_symbols,
keytype = "SYMBOL",
columns = c("SYMBOL", "ENTREZID")
) %>%
dplyr::filter(!is.na(.data$ENTREZID)) %>%
dplyr::mutate(ENTREZID = as.character(.data$ENTREZID))
genes_chr21 <- suppressMessages(GenomicFeatures::genes(txdb))
genes_chr21 <- genes_chr21[as.character(GenomeInfoDb::seqnames(genes_chr21)) == "chr21"]
genes_muscle <- genes_chr21[as.character(genes_chr21$gene_id) %in% unique(muscle_map$ENTREZID)]
# TSS windows for muscle genes
tss_points_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 0L, downstream = 1L))
tss_windows_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 2000L, downstream = 2000L))
tss_site_m <- start(tss_points_m)
tss_strand_m <- as.character(strand(tss_points_m))
# Compute TSS enrichment profile
tss_profile <- profile_tss_enrichment_starts(
fragment_bed_gz = fragment_bed_path,
tss_windows = tss_windows_m,
tss_site = tss_site_m,
tss_strand = tss_strand_m,
flank = 2000L
)Code
# Smooth the profile
tss_profile_smooth <- tss_profile %>%
dplyr::mutate(
enrichment_smooth = moving_average(enrichment, k = 21),
enrichment_smooth = dplyr::if_else(is.na(enrichment_smooth), enrichment, enrichment_smooth)
)
p_tss <- ggplot(tss_profile_smooth, aes(x = rel_bp, y = enrichment_smooth)) +
geom_hline(yintercept = 1, color = "gray60", linewidth = 0.35) +
geom_vline(xintercept = 0, color = "gray30", linewidth = 0.5) +
geom_line(linewidth = 0.9, color = "#2E4A62") +
annotate("text", x = 50, y = max(tss_profile_smooth$enrichment_smooth, na.rm = TRUE) * 0.95,
label = "TSS", hjust = 0, size = 4, fontface = "bold") +
labs(
title = "TSS Enrichment of Fragment 5' Start Sites",
subtitle = paste("Muscle genes (COL6A1, COL6A2) on chr21 | Sample:", SAMPLE_ID),
x = "Distance to TSS (bp)",
y = "Normalized Fragment Density"
) +
theme_pub(base_size = 12)
p_tssEnd Motif Analysis
Cell-free DNA fragments exhibit characteristic 5’ end motif patterns reflecting the sequence preferences of nucleases involved in cfDNA generation.
Code
# Extract 4-mer end motifs
motifs_all <- all_4mers()
valid_seqnames <- seqnames(genome)
motif_counts <- extract_4mer_counts_from_frag_bed(
frag_bed_gz = fragment_bed_path,
sample_id = SAMPLE_ID,
genome = genome,
valid_seqnames = valid_seqnames,
motifs_all = motifs_all
)
# Calculate frequencies
motif_freq <- motif_counts %>%
dplyr::mutate(
total = sum(count),
frequency = count / total
)Top 20 End Motifs
Code
# Get top 20 motifs
top20_motifs <- motif_freq %>%
dplyr::arrange(desc(frequency)) %>%
dplyr::slice_head(n = 20) %>%
dplyr::mutate(motif = factor(motif, levels = rev(motif)))
p_top20 <- ggplot(top20_motifs, aes(x = frequency, y = motif)) +
geom_col(fill = "#3C5488", alpha = 0.9, width = 0.7) +
geom_text(aes(label = sprintf("%.4f", frequency)), hjust = -0.1, size = 3) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Top 20 cfDNA 5' End Motifs (4-mer)",
subtitle = paste("Sample:", SAMPLE_ID),
x = "Frequency",
y = "Motif"
) +
theme_pub(base_size = 12) +
theme(panel.grid.major.y = element_blank())
p_top20Motif Diversity Score
The Motif Diversity Score (MDS) is calculated as the normalized Shannon entropy of the 256 4-mer frequencies.
Code
# Calculate MDS
freq_vec <- motif_freq$frequency
entropy_bits <- shannon_entropy_bits(freq_vec)
max_entropy <- log2(256) # Maximum possible entropy for 256 motifs
mds <- entropy_bits / max_entropy
mds_display <- tibble::tibble(
Metric = c("Shannon Entropy (bits)", "Maximum Entropy (bits)", "Motif Diversity Score (MDS)"),
Value = c(sprintf("%.4f", entropy_bits), sprintf("%.4f", max_entropy), sprintf("%.4f", mds))
)
knitr::kable(mds_display, caption = "End Motif Diversity Metrics") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | Value |
|---|---|
| Shannon Entropy (bits) | 7.6619 |
| Maximum Entropy (bits) | 8.0000 |
| Motif Diversity Score (MDS) | 0.9577 |
Code
# Create a simple gauge-like visualization for MDS
p_mds <- ggplot() +
geom_rect(aes(xmin = 0, xmax = 1, ymin = 0, ymax = 1), fill = "gray90") +
geom_rect(aes(xmin = 0, xmax = mds, ymin = 0, ymax = 1), fill = "#3C5488", alpha = 0.8) +
geom_text(aes(x = 0.5, y = 0.5, label = sprintf("MDS = %.4f", mds)),
size = 8, fontface = "bold", color = "white") +
geom_text(aes(x = 0.5, y = -0.3, label = "Shannon entropy of 256 4-mer frequencies (normalized)"),
size = 3.5, color = "gray40") +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
scale_y_continuous(limits = c(-0.5, 1.2)) +
labs(title = paste("Motif Diversity Score:", SAMPLE_ID)) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5, size = 14, margin = ggplot2::margin(b = 10)),
axis.text.x = element_text(color = "gray30"),
axis.ticks.x = element_line(color = "gray30")
)
p_mdsClassification Prediction
This section uses the pre-trained Random Forest model to predict whether the sample belongs to the ALS or Control group based on MethTile (1 kb genomic tiles) methylation features.
Code
# Check if all model files exist
model_available <- file.exists(MODEL_PATH) &&
file.exists(TILE_SCALED_PATH) &&
file.exists(SCALING_PARAMS_PATH)
if (model_available) {
# Load pre-trained model and saved data
all_results <- readRDS(MODEL_PATH)
tile_scaled <- readRDS(TILE_SCALED_PATH)
scaling_params <- readRDS(SCALING_PARAMS_PATH)
# Get model result and selected features
model_result <- all_results[["MethTile"]]
selected_features <- model_result$selected_features
# Check if sample is in the training set
sample_in_training <- SAMPLE_ID %in% rownames(tile_scaled)
if (sample_in_training) {
# Use EXACT values from training (ensures perfect consistency)
message(sprintf("Sample %s found in training set. Using exact scaled values.", SAMPLE_ID))
new_features <- tile_scaled[SAMPLE_ID, selected_features, drop = FALSE]
} else {
# For samples NOT in training set, compute features and apply saved scaling
message(sprintf("Sample %s not in training set. Computing features from bismark coverage file.", SAMPLE_ID))
# Load methylation data from bismark coverage file
bs <- bsseq::read.bismark(
files = BISMARK_COV_FILE,
rmZeroCov = TRUE,
strandCollapse = TRUE,
verbose = FALSE
)
# Generate 1kb tiles on chr21 (same as training)
tile_gr <- GenomicRanges::tileGenome(
seqlengths = GenomeInfoDb::seqlengths(genome)["chr21"],
tilewidth = 1000,
cut.last.tile.in.chrom = TRUE
)
# Calculate mean methylation per tile
meth_by_tile <- bsseq::getMeth(bs, regions = tile_gr, type = "raw", what = "perRegion")
# Create tile names (same format as training: chr21_start_end)
tile_names <- sprintf("%s_%d_%d",
as.character(GenomicRanges::seqnames(tile_gr)),
GenomicRanges::start(tile_gr),
GenomicRanges::end(tile_gr))
# Convert to named vector (percent methylation, same as training)
tile_meth <- setNames(as.numeric(meth_by_tile[, 1]) * 100, tile_names)
# Get raw values for selected features
new_features_raw <- tile_meth[selected_features]
# Get scaling parameters for selected features
feature_centers <- scaling_params$center[selected_features]
feature_scales <- scaling_params$scale[selected_features]
# Impute missing values with 0 (center value after scaling)
# This is consistent with training where missing values result in 0 after imputation + scaling
new_features_raw[is.na(new_features_raw)] <- feature_centers[is.na(new_features_raw)]
# Apply z-score scaling using exact training parameters
new_features_scaled <- (new_features_raw - feature_centers) / feature_scales
new_features_scaled[is.nan(new_features_scaled)] <- 0
# Create matrix for prediction
new_features <- matrix(new_features_scaled, nrow = 1,
dimnames = list(SAMPLE_ID, selected_features))
}
# Make prediction using the model
prediction <- predict_single_sample(new_features, model_result)
# Store for reporting
prediction_available <- TRUE
prediction_source <- if (sample_in_training) "training_set" else "computed"
} else {
prediction_available <- FALSE
missing_files <- c()
if (!file.exists(MODEL_PATH)) missing_files <- c(missing_files, "Pre-trained model")
if (!file.exists(TILE_SCALED_PATH)) missing_files <- c(missing_files, "Tile scaled matrix")
if (!file.exists(SCALING_PARAMS_PATH)) missing_files <- c(missing_files, "Scaling parameters")
message("Missing files for classification: ", paste(missing_files, collapse = ", "))
}Code
if (prediction_available) {
# Display prediction results
pred_class <- prediction$predicted_class
pred_prob <- prediction$probabilities
pred_color <- if (pred_class == "ALS") "#E64B35" else "#4DBBD5"
# Show feature source
source_msg <- if (prediction_source == "training_set") {
"Features extracted from training set (exact match)"
} else {
"Features computed from bismark coverage file"
}
message("Feature source: ", source_msg)
}Code
if (prediction_available) {
# Create prediction visualization
prob_data <- tibble::tibble(
Group = c("Ctrl", "ALS"),
Probability = c(pred_prob$Ctrl, pred_prob$ALS)
) %>%
dplyr::mutate(Group = factor(Group, levels = c("Ctrl", "ALS")))
# Create label for binary prediction
confidence <- max(pred_prob$Ctrl, pred_prob$ALS) * 100
prediction_label <- sprintf("Prediction: %s (%.1f%% confidence)", pred_class, confidence)
p_pred <- ggplot(prob_data, aes(x = Group, y = Probability, fill = Group)) +
geom_col(width = 0.6, alpha = 0.9) +
geom_text(aes(label = sprintf("%.1f%%", Probability * 100)),
vjust = -0.5, size = 5, fontface = "bold") +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray50") +
annotate("label", x = 1.5, y = 0.95, label = prediction_label,
size = 5, fontface = "bold", fill = pred_color, color = "white",
label.padding = unit(0.5, "lines"), label.r = unit(0.3, "lines")) +
scale_fill_manual(values = COLORS) +
scale_y_continuous(limits = c(0, 1.15), labels = scales::percent) +
labs(
title = "Classification Prediction Result",
subtitle = paste("Sample:", SAMPLE_ID, "| Model: MethTile (1kb) | Features:",
length(selected_features), "tiles"),
x = NULL,
y = "Prediction Probability"
) +
theme_pub(base_size = 14) +
theme(legend.position = "none")
print(p_pred)
}Code
if (prediction_available) {
pred_table <- tibble::tibble(
Metric = c("Predicted Class", "Ctrl Probability", "ALS Probability", "Confidence"),
Value = c(
pred_class,
sprintf("%.2f%%", pred_prob$Ctrl * 100),
sprintf("%.2f%%", pred_prob$ALS * 100),
sprintf("%.2f%%", max(pred_prob$Ctrl, pred_prob$ALS) * 100)
)
)
knitr::kable(pred_table, caption = "Classification Prediction Results") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::row_spec(1, bold = TRUE, color = "white", background = pred_color)
}| Metric | Value |
|---|---|
| Predicted Class | ALS |
| Ctrl Probability | 3.00% |
| ALS Probability | 97.00% |
| Confidence | 97.00% |
Summary
Code
# Compile all key results
summary_results <- tibble::tibble(
Category = c(
rep("Quality Control", 5),
rep("Fragment Analysis", 4),
rep("End Motifs", 2),
if (prediction_available) "Classification" else character(0)
),
Metric = c(
"Total Reads", "Mapping Rate", "Bisulfite Conversion", "CpG Methylation", "Mean MAPQ",
"Median Fragment Length", "Mode Fragment Length", "Short/Long Ratio", "TSS Enrichment Score",
"Top Motif", "Motif Diversity (MDS)",
if (prediction_available) "Predicted Group" else character(0)
),
Value = c(
format(summary_stats$total_reads, big.mark = ","),
sprintf("%.2f%%", summary_stats$mapping_rate * 100),
sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100),
sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
sprintf("%.1f", summary_stats$mean_mapq),
sprintf("%d bp", as.integer(summary_stats$median_frag_length)),
sprintf("%d bp", summary_stats$frag_mode),
sprintf("%.3f", short_long_ratio),
sprintf("%.2f", max(tss_profile$enrichment, na.rm = TRUE)),
as.character(top20_motifs$motif[1]),
sprintf("%.4f", mds),
if (prediction_available) sprintf("%s (%.1f%%)", pred_class, max(pred_prob) * 100) else character(0)
)
)
knitr::kable(summary_results, caption = paste("Analysis Summary for", SAMPLE_ID)) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
kableExtra::pack_rows("Quality Control", 1, 5) %>%
kableExtra::pack_rows("Fragment Analysis", 6, 9) %>%
kableExtra::pack_rows("End Motifs", 10, 11) %>%
{if (prediction_available) kableExtra::pack_rows(., "Classification", 12, 12) else .}| Category | Metric | Value |
|---|---|---|
| Quality Control | ||
| Quality Control | Total Reads | 125,496 |
| Quality Control | Mapping Rate | 100.00% |
| Quality Control | Bisulfite Conversion | 99.74% |
| Quality Control | CpG Methylation | 81.27% |
| Quality Control | Mean MAPQ | 38.2 |
| Fragment Analysis | ||
| Fragment Analysis | Median Fragment Length | 156 bp |
| Fragment Analysis | Mode Fragment Length | 156 bp |
| Fragment Analysis | Short/Long Ratio | 0.642 |
| Fragment Analysis | TSS Enrichment Score | -Inf |
| End Motifs | ||
| End Motifs | Top Motif | AAAA |
| End Motifs | Motif Diversity (MDS) | 0.9577 |
| Classification | ||
| Classification | Predicted Group | ALS (97.0%) |
Methods
Data Processing
The BAM file was processed using custom R functions that implement the following pipeline:
- Read Filtering: Reads were filtered to retain properly paired, non-duplicate alignments with MAPQ ≥ 30.
- Fragment Extraction: Fragment coordinates were derived from paired-end alignments using the union span of mate alignments.
- GC Content: Fragment GC content was calculated as (G+C)/(A+C+G+T) excluding N bases.
- Methylation Extraction: CpG methylation was extracted from Bismark XM tags or coverage files.
Fragmentation Analysis
- Insert Size Distribution: Fragment lengths were computed from paired-end alignment coordinates.
- Short/Long Ratio: Ratio of fragments 90-150 bp (short) to 151-220 bp (long).
- GC Correction: LOESS regression was used to correct for GC bias in fragment counts.
End Motif Analysis
- Motif Extraction: 4-mer sequences were extracted from both 5’ ends of each fragment (strand-aware).
- Motif Diversity Score: Calculated as normalized Shannon entropy: MDS = H(p) / log₂(256).
Classification
- Model: Random Forest classifier trained on ALS vs Control cohort data.
- Features: stackHMM-based regional methylation levels (100 chromatin states).
- Validation: Nested cross-validation (LOOCV outer, 5-fold inner) with Boruta feature selection.
Software
This analysis was performed using R version R version 4.5.1 (2025-06-13) with Bioconductor packages for genomic analysis. All core algorithms are identical to those used in the cohort analysis to ensure reproducibility.
Report generated on 2026-01-30 17:46:20.68722